PLOS Digital Health — Latest Matching Preprints

1

PhysiCase: Development and dual-layer validation of synthetic cases for health professional education: A pilot study leveraging Generative AI

Komolafe, O. O.; Roberts, A. C.; Shelley, J.; Tawiah, A. K.

2026-06-09 rehabilitation medicine and physical therapy 10.64898/2026.06.07.26355114 medRxiv

Top 0.1%

27.0%

Show abstract

High-quality, domain-specific datasets are foundational to advancing educational tools and AI systems in healthcare, yet assembling case repositories from real-world clinical records faces substantial privacy, ethical, and licensing barriers. Synthetic data generation offers a compelling pathway forward, but educational cases require rigorous validation to ensure clinical plausibility and pedagogical utility. This pilot study introduces PhysiCase, a dual-layer validation pipeline for synthetic case generation and evaluates the feasibility of combining automated LLM-based screening with expert educator review. We generated 128 synthetic musculoskeletal(MSK) cases using four frontier large language models (GPT-4.1, GPT-4o, Google Gemini 2.5 Pro, and Llama 4 Scout) across 28 clinical conditions. Cases underwent automated quality screening using an "LLM-as-judge" framework (DeepEval) assessing prompt alignment, JSON correctness, answer relevance, bias, toxicity, and completeness. Ninety cases (70.3%) passed automated filtering and proceeded to expert evaluation by four MSK physiotherapy educators, who rated medical accuracy, realism, fidelity, relevance, and usability on 5-point Likert scales. GPT-4.1 demonstrated the highest automated pass rate (96\%) and strongest expert ratings (medical accuracy 4.10/5, usability 4.38/5), while Llama 4 Scout showed the lowest pass rate (33.3%) and expert ratings. Expert-evaluated cases achieved strong content validity indices for usability (97.5%), relevance (97.5%), and realism (95%), though medical accuracy showed greater variance (CVI 87.5%). Cross-layer correlation analysis revealed that automated completeness metrics moderately aligned with expert usability ratings , while answer relevance and prompt alignment showed weak or negative correlations with clinical correctness. Qualitative analysis identified three primary failure modes: reductive logic, biomechanical inconsistency, and administrative/contextual gaps. The dual-layer validation framework proved methodologically viable: automated screening efficiently reduced expert review burden, while human judgment remained indispensable for detecting subtle clinical reasoning failures. LLM-generated synthetic cases has the potential to meet practical educational needs for MSK physiotherapy, but expert validation is essential to safeguard clinical accuracy. These findings support a scalable division of labour for synthetic case development, with targeted improvements to prompting and automated reasoning checks needed to address identified "nuance gaps." The code for this paper is available on https://github.com/kwid-ai/PhysiCase

2

AI Adoption for NCDs in Kenya: A Qualitative Study

Rayo, J.; Cushny, W.; Mwangi, M.; Wanyee, S.; Linguraru, M. G.; Nyaga, N.; Koros, H.; Bosire, M.; Obuya, M.; Ngaruiya, C.

2026-05-27 public and global health 10.64898/2026.05.26.26354008 medRxiv

Top 0.1%

25.5%

Show abstract

Background: Non-communicable diseases (NCDs) represent a critical public health challenge in Kenya, responsible for over 50% of inpatient admissions and 40% of deaths. While digital health tools and artificial intelligence offer promising ways to improve prevention, diagnosis, and management, little is known about how these tools are perceived and used in practice. There is limited research exploring the views and lived experiences of young people in Kenya, who are a strategic priority for NCD prevention because behavioral risk factors are established in this window, and for Community Health Providers (CHPs) who provide health services within the community. This study aims to address this gap by examining the perspectives of the burden of non-communicable diseases and the potential role of digital health technologies, including artificial intelligence, for preventing and managing these conditions in these specific populations. Methods: A qualitative research design using focus group discussions (FGDs) was employed in Nairobi (urban) and Busia (rural) counties between March and July 2024. Eight FGDs were conducted with 60 participants purposively sampled from three stakeholder groups: community health promoters (CHPs), healthcare workers (HCWs), and youth aged 18-35 years. A semi-structured guide, co-developed with a Community Advisory Board, explored beliefs about NCDs, health-seeking behaviors, lifestyle practices, and attitudes toward digital health and AI. Audio recordings were transcribed verbatim, translated where necessary, and analyzed thematically using grounded theory principles on NVivo software (v12). Results: Six consolidated themes emerged: (1) understanding of NCDs and perceived risk; (2) barriers to NCD prevention and care; (3) the role of CHPs; (4) adoption of AI tools for NCD management; (5) trust, ethics and access concerns; and (6) community-driven recommendations for AI integration. Significant barriers including stigma, economic constraints, and barriers to care were documented alongside enthusiasm for AI tools among youth and CHPs in both urban and rural areas. Conclusion: This study shows that AI tools are being used for NCD prevention and management through spontaneous community adoption. However, it emphasizes the need for culturally relevant, equitable, and community-driven solutions. Effective scaling requires the identification and bridging of digital literacy gaps, the establishment of affordable infrastructure, the protection of data privacy, and the integration of artificial intelligence tools into existing community health frameworks. This process should involve the collaboration of trusted intermediaries, such as CHPs and community leaders, to ensure successful outcomes. Future initiatives should prioritize participatory design, policy frameworks for ethical governance, and targeted capacity building to enhance acceptance and sustainability of digital health innovations in low- and middle-income country settings.

3

Technology acceptance of machine learning in life sciences: the role of hype perception and journal impact factor.

Serrano, A. E.

2026-06-09 health informatics 10.64898/2026.06.03.26354262 medRxiv

Top 0.1%

22.9%

Show abstract

Machine learning (ML) has emerged as a transformative technology across biomedical and life science sectors, with applications spanning drug discovery, medical imaging, genomics, and clinical decision support (Goecks et al., 2020; Patel et al., 2020). Despite exponential growth in ML-related publications, from fewer than 100 articles in 2003 to nearly 25,000 by 2021 (NCBI, 2022), adoption among industry professionals remains uneven and sector-dependent. Understanding what drives or inhibits this adoption is critical for organisations seeking to leverage ML capabilities in research and clinical practice. Technology adoption in organisational contexts has been extensively studied through the Technology Acceptance Model (TAM), originally proposed by Davis (1989) and subsequently extended to incorporate external variables influencing perceived usefulness (PU) and perceived ease of use (PEU) (Venkatesh & Davis, 1996). While TAM has been applied across multiple industries, its application within biomedical and life science contexts remains limited, and the industry-specific factors that shape ML acceptance in this sector have not been systematically examined. Two external variables are particularly relevant to life science professionals. First, the bibliometric journal impact factor (JIF) functions as a cognitive signal of scientific credibility, a sector where evidence-based decision-making is culturally embedded, and publication quality serves as a proxy for technological legitimacy (Garfield, 1996). Second, technology hype, operationalised through the Gartner Hype Cycle framework, represents a social influence variable that shapes organisational expectations and investment decisions around emerging technologies (Gartner Inc., 2018). Whether these variables influence ML acceptance among life science professionals, alongside individual knowledge and experience, has not been empirically tested. This study addresses that gap by investigating ML technology acceptance among 213 biomedical and life science professionals across EMEA, LATAM, and North America, using a cross-sectional quantitative survey and PLS-SEM analysis. The TAM model is extended with three external variables, JIF, technology hype, and prior knowledge and experience, to test their influence on PU and PEU in this specific professional context. Additionally, the study examines demographic and regional differences in ML acceptance, with particular attention to variation between academic researchers and healthcare professionals. The findings contribute a validated, sector-specific extension of TAM for life sciences, provide actionable insights for organisations seeking to accelerate ML implementation, and establish a framework for future subsector-specific research.

4

Daily symptom monitoring is sustainable over months: retention, not compliance, is the primary barrier to long-duration digital tracking

Gunsilius, C. Z.; Pei, P.; Carayannopoulos, A.; Petzschner, F. H.

2026-06-10 rehabilitation medicine and physical therapy 10.64898/2026.06.08.26355180 medRxiv

Top 0.1%

22.4%

Show abstract

Ecological momentary assessment (EMA) enables real-time, longitudinal measurement of symptoms and behavior via smartphones, yet nearly all feasibility evidence comes from protocols lasting one to two weeks, far shorter than the timescales over which chronic diseases fluctuate and clinical decisions unfold. Whether daily compliance can be sustained over months, or whether it decays as short-protocol trends predict, is unknown. Here, 214 participants (173 with pain, 41 healthy controls) completed a 4-month (122-day) EMA protocol via the Soma smartphone app, generating 26,907 check-ins. Half the sample completed the full protocol without a two-week lapse. Aggregate compliance appeared moderate (50%), but this conflated two distinct phenomena: when recomputed over each participant's active period, compliance rose to 71%, with 91% achieving moderate-to-high adherence, and remained stable across all 17 study weeks. Pain status predicted earlier disengagement but not lower compliance among those who remained; after adjustment for differential retention, group differences disappeared. To our knowledge, this is the longest continuous daily EMA evaluation in a clinical population. It suggests the primary barrier to long-duration EMA is not declining motivation among active participants but concentrated early disengagement, with direct implications for the design of digital health protocols, decentralized trials, and remote symptom monitoring.

5

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

Kang, W. J.; Sim, J.; Loh, E. E. M.; Lim, A. C. Y.; FOONG, K. W. C.

2026-05-20 health informatics 10.64898/2026.05.17.26353409 medRxiv

Top 0.1%

20.0%

Show abstract

Importance. Large language models are increasingly explored as clinical decision support tools in orthodontics, yet existing evaluations have been confined to knowledge based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined. Objective. To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN DHC) classification, space analysis, and lateral cephalometric interpretation. Methods. In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran's Q test with post-hoc McNemar's tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations). Results. Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4 to 99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6 to 98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8 to 96.9%) (Cochran's Q=6.87, p=0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal-abnormal classification boundary. An accuracy-consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context rich prompting eliminated all errors across all three models. Interpretation. Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.

6

Pixel-Based Skin Tone Estimation on Dermoscopy: A Dual-Rater MST Benchmark and Feasibility Study

Kumarasinghe, A.; Bui, V.; Ghanbarzadeh, R.

2026-05-17 health informatics 10.64898/2026.05.13.26353004 medRxiv

Top 0.1%

19.2%

Show abstract

Skin-tone labels are absent from public dermoscopy benchmarks such as the International Skin Imaging Collaboration (ISIC), making it impossible to audit whether clinical AI performs equitably across skin tones. While several recent works estimate skin tone automatically from clinical photography and selfies, we ask whether this approach is feasible on dermoscopy, the primary imaging modality of these benchmarks. To answer this, we make three main contributions. First, we release MST-Derm, a dual-rater Monk Skin Tone (MST) annotation benchmark on 500 ISIC 2018 images. Raters were given an explicit unrateable option for crops where the skin surrounding the lesion was too occluded to label confidently. We find that 60% of images were marked unrateable, yielding a 193-image consensus subset (quadratic-weighted Cohen's Kappa = 0.82). Second, we conduct a systematic feasibility study of three pixel-based MST annotation pipelines spanning the principal families in prior work: palette matching in perceptual colour space, robust colour statistics, and projection to a 1D colorimetric scalar. All three pipelines produce ordinal signal above chance (95% confidence intervals on quadratic-weighted Kappa exclude zero). However, ISIC 2018's extreme light-skin bias leaves 82% of the evaluation set at MST 2, giving a constant "always predict MST 2" baseline an accuracy floor the methods cannot overcome. To separate algorithmic signal from dataset bias, we evaluate on a class-balanced subset. The best method reaches quadratic-weighted Kappa = 0.43 against the trivial baseline of Kappa = 0.00, confirming the signal is genuine. Third, we diagnose this performance ceiling. We trace the bottleneck to two causes: dermoscopy's specialised illumination physically compresses the colour range on which lighter skin tones differ, and ISIC's dataset skew makes standard absolute-accuracy metrics uninformative. We conclude that while pixel-based colour features carry real MST signal on dermoscopy, current performance is insufficient for autonomous annotation. We release the benchmark, annotation protocol, all prediction runs, and analysis code to facilitate the development of robust skin-tone estimators, a vital prerequisite for accurately auditing fairness and mitigating bias in dermatological machine learning.

7

Rheumatic Heart Disease Detection in Asymptomatic Schoolchildren using ECG and PCG

Chuma, A. T.; Wang, C.; Voigt, J.-u.; Mekonnen, D.; Asmare, M. H.; Vanrumste, B.

2026-05-15 health informatics 10.64898/2026.05.12.26352939 medRxiv

Top 0.1%

17.6%

Show abstract

Rheumatic heart disease (RHD) remains a major public health concern across low- and middle-income countries in the Global South. Early detection through community-based screening of asymptomatic individuals has been identified as a critical strategy for reducing the disease burden. Despite this, the absence of accessible, automated population screening tools continues to impede implementation at scale. This study investigates the screening potential of integrating electrocardiography (ECG) and phonocardiography (PCG) for the early detection of RHD in asymptomatic schoolchildren. The dataset was obtained as part of an ambulatory screening initiative conducted across multiple school sites in rural areas of Ethiopia. It comprised ECG and PCG recordings from 611 asymptomatic schoolchildren aged 10 to 20 years. A comprehensive set of time-frequency, visibility graph and non-linear features were extracted from both signal modalities. These features were subsequently evaluated using machine learning models to assess their utility in the automated screening of early RHD. The best model achieved an average 10-folds cross-validation scores on sensitivity, positive-predictive-value and F1-score of 59.6%, 63.6% and 60.8%, respectively for multimodal ECG and PCG signals. Whereas separate evaluation of ECG showed an F1-score of 61.1% and PCG achieved 23.5%. Key features included the T-wave, the area under the QRS complex, and entropy measures derived from beat visibility graphs in the ECG. In addition, visibility graph features from multi-band S1 and S2 heart sound segments, along with MFCC coefficients from the PCG, were also relevant. However, PCG alone performed poorly and did not show improved results over the ECG features. Although auscultation is key clinical diagnosis tool in symptomatic RHD, combined PCG with ECG features does not enhance asymptomatic RHD detection using the ECG modality alone.

8

Cohort profile: The Australian Children of the Digital Age (ACODA) longitudinal cohort study measuring the digital lives of Australians during early childhood

MacKenzie, J.; Johnson, D.; Sarra, G.; Matthews, J. R.; Martinez-Buelvas, L.; Trenaman, D.; Sefton-Green, J.; Howard, S. J.; Smith, S. S.; Danby, S.; Zabatiero, J.

2026-05-13 pediatrics 10.64898/2026.05.09.26352795 medRxiv

Top 0.1%

17.5%

Show abstract

ObjectivesThe Australian Children of the Digital Age (ACODA) study is a longitudinal cohort study investigating the digital lives of Australians during early childhood. This paper presents a comprehensive description of the study protocol and overview of childrens digital technology use in the home at the first wave of data collection. MethodsCaregivers of children aged 6-months to 5-years completed a survey that captured the availability and use of digital technology within the home, and child- and caregiver-related factors that may influence childrens digital technology use. ResultsA total of 3,388 caregivers from across all Australian states and territories completed the survey. Majority (98%) of children had digital technology and internet access within their homes. Most children (93%) used at least one device in the last year, with televisions, tablets, and mobile phones most frequently used (89%, 47%, 42%, respectively). Digital technology use started early, with 61% of children aged <1-year having used a television. A greater proportion of older children used devices, and for longer durations than younger children. Across all ages, daily time was longest on televisions (M = 1:20, SD = 1:14), tablets (M = 1:06, SD = 1:36), and mobile phones (M = 0:30, SD = 1:05). Digital technology was used most for entertainment and learning activities, and was used typically with a caregiver and in lounge/living rooms. ConclusionsThe ACODA study is the first longitudinal study to describe the digital technology use of Australians during early childhood and the context of this use. Data indicated that Australian children frequently used digital technology for entertainment and with their caregivers. Also, older children used digital technology more than younger children. Future waves allow for exploration of changes in childrens digital technology use over time, and associations with factors that may influence childrens digital technology use.

9

The Verification Gap: Artificial Intelligence Adoption, Hallucination Awareness, and Verification Practices Among Early Career Medical Researchers in Pakistan

Sajjad, M.

2026-05-30 health informatics 10.64898/2026.05.28.26354373 medRxiv

Top 0.1%

12.8%

Show abstract

Artificial intelligence (AI) tools have been rapidly adopted by medical researchers, yet whether early career researchers in low and middle income countries possess the awareness and habits needed to use these tools safely remains poorly documented. This study characterized AI adoption patterns, hallucination awareness, and verification and disclosure practices among early career medical researchers in Pakistan. A cross sectional anonymous online survey was conducted among medical students, house officers, residents, physicians, and faculty involved in research or academic work across Pakistan (May 2026). Descriptive statistics and chi square tests were applied to 373 eligible responses. AI use was near universal (99.7%), with 60.3% using AI tools daily. The most commonly reported tool in this sample was Claude (40.5%), followed by ChatGPT (29.2%) and Perplexity (26.0%), though this ranking likely reflects sampling characteristics. Despite high adoption, 59.2% typically did not verify AI outputs before use, and 40.2% had never heard that AI can generate fabricated scientific references. In behavioral vignettes, 36.5% assumed convincing AI generated references were authentic, and 54.2% would continue using remaining AI content after discovering one fabricated reference. Formal research training was strongly associated with consistent disclosure (51.7% vs. 17.1%; chi square=48.43, p less than 0.001). Role, daily use frequency, and research training were not significantly associated with verification behavior. Early career medical researchers in Pakistan demonstrate high AI adoption alongside incomplete hallucination awareness and infrequent verification, a pattern that may carry implications for research integrity. Formal training was the only factor significantly associated with consistent disclosure. Integration of AI literacy into medical curricula and institutional governance frameworks merits consideration.

10

AI Decision Support for Challenging Teledermatology Cases: MedGemma Performance in the Dermatology ECHO Program

Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.

2026-05-26 health informatics 10.64898/2026.05.21.26353523 medRxiv

Top 0.1%

12.1%

Show abstract

Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.

11

Design and Validation of an AI-Assisted Sequential Screening Framework for Psychological Distress in Glaucoma

Chou, N. A.; Baek, Y.; Feng, F.; Lu, K.; Choi, E. Y.; Fisher, H. M.; Malek, D.; Jammal, A.; Somers, T. J.; Muir, K. W.; Medeiros, F. A.; Berchuck, S. I.

2026-05-22 ophthalmology 10.64898/2026.05.20.26353679 medRxiv

Top 0.1%

12.0%

Show abstract

Purpose: Psychological distress is highly prevalent in glaucoma and is associated with worse adherence, reduced quality of life, and faster disease progression. However, distress is rarely assessed in ophthalmology settings due to time, workflow, and staffing constraints. We evaluated two artificial intelligence (AI)-based screening strategies, designed to efficiently identify distressed primary open angle glaucoma (POAG) patients during routine care, aiming to achieve effective, resource conscious, low burden clinical screening. Design: Hybrid retrospective cohort and prospective cross-sectional study. Participants: The retrospective cohort included >3,000 POAG patients from the Duke Ophthalmic Registry. Prospective validation was conducted in a separate 300 POAG patient cohort who completed patient-reported distress screening. Methods: Using retrospective data, a neural network model was trained to predict an electronic health record (EHR)-derived computable phenotype of distress ("silver standard"). Prospective validation used the 8-item Patient Health Questionnaire (PHQ-8) as the "gold standard." Three screening strategies were compared against PHQ-8: (1) universal PHQ-2 screening (two-item screener administered to all patients), (2) AI-only screening (fully automated EHR-based screener), and (3) sequential screening, (only patients flagged as high risk by AI screener completed the PHQ-2). Performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and screening burden. Main Outcome Measures: Sensitivity; specificity; PPV; NPV; accuracy; proportion of patients requiring secondary screening (screening burden). Results: Distress prevalence was 17% (PHQ-8 > 6). Universal PHQ-2 screening (> 0) achieved high sensitivity (0.96) but lower specificity (0.73) and PPV (0.41), while requiring screening of all patients. The AI-assisted sequential approach substantially reduced screening burden while maintaining strong diagnostic performance. By administering PHQ-2 to ~25% of patients, sequential screening achieved sensitivity 0.64, specificity 0.93, PPV 0.64, NPV 0.93, and accuracy 0.88, representing a ~50% increase in PPV compared to PHQ-2 alone. AI-only screening reduced burden further but did not achieve comparable sensitivity or predictive performance. Conclusions: AI-assisted sequential screening enables scalable, resource efficient identification of psychological distress in glaucoma care, substantially reducing screening burden while preserving clinically meaningful performance. This framework offers a practical pathway for integrating distress screening into routine ophthalmology workflows and improving the identification and referral of at-risk patients.

12

I had to learn to trust my body again: Exploring the emotional and behavioural impact of wearable activity tracker discontinuation and reasons for removal.

Humphreys, G.; Jensen, S.; Manchester, K.; Sanal-Hayes, N.; Gluchowski, A.

2026-05-18 health informatics 10.64898/2026.05.14.26353189 medRxiv

Top 0.2%

10.3%

Show abstract

While wearable activity trackers (WATs) are widely used in the present day, with device ownership increasing, some individuals subsequently discontinue device use. Existing research primarily examines the initiation and maintenance of device use, with less focus on device discontinuation. Examining this phenomenon can provide valuable insight into human-computer interactions and habit reversal. Therefore, the current study examined the perceived emotional and behavioural impact of WAT discontinuation, alongside reasons for this action in former WAT users. Fifteen former WAT users (9 female, aged 23 to 56 years) who reported either full or partial device discontinuation were interviewed. Three themes and nine sub-themes were identified which detailed the impacts of device discontinuation. Participants reported a mindset shift around ones body image, exercise performance and exercise motivation. Device discontinuation removed numerical feedback provision which led to participants gaining bodily intuition and a sense of freedom. However, discontinuation also resulted in short-term negative emotions including frustration around the loss of external praise and envy in current WAT users. Current findings hold important implications around digital safety from user perspective, highlighting the need for guidance around healthy WAT use and vulnerable user profiles. More broadly, findings also raise the need for physical activity promotion whilst protecting individuals well-being.

13

Impact of Imaging Protocols on Thermal Detection of Pressure Injuries: Threshold versus Deep Learning Across Skin Tones

Asare-Baiden, M.; Sonenblum, S. E.; Jordan, K.; Tomi John, G.; Chung, A.; Gichoya, J. W.; Hertzberg, V. S.; Ho, J. C.

2026-05-24 medical ethics 10.64898/2026.05.21.26353842 medRxiv

Top 0.2%

9.1%

Show abstract

Pressure injuries represent a significant healthcare challenge requiring early detection to prevent severe complications. While thermal imaging shows promise for detecting early pressure-related temperature changes, its robustness across varying imaging conditions and diverse patient populations remains unclear. This study systematically evaluated how imaging protocol variations (lighting, distance, positioning, camera type) and participant skin tone influence classification model performance for thermal cooling detection. Using a controlled cooling protocol to simulate early pressure injury temperature changes, we collected 1,680 images from 35 diverse participants across 12 imaging protocol variations. We compared two approaches: three deep learning models (MobileNetV2, InceptionNetV3, ResNet50) and a threshold-based approach using an optimal fixed threshold temperature differential. Deep learning models outperformed the threshold-based approach, achieving 98.6-99.6% accuracy compared to 95.6%, with superior performance across all imaging protocols and skin tone groups. Threshold-based approach showed camera-dependent misclassification patterns across skin tones. On the high-resolution FLIR E8XT, the MST 7-10 group had 8 of 11 misclassifications. This pattern shifted on the low-resolution FLIR ONE Pro, where the intermediate skin tone group (MST 6) had 22 of 44 total misclassifications.In contrast, deep learning models maintained consistent performance across all skin tone groups and imaging protocols. Visualization analysis of the deep learning models suggested that these models focused on thermal gradients at cooling region boundaries, suggesting that spatial temperature gradients, not single-value thresholds, are critical for accurate detection. These findings suggest the potential of deep learning-based approaches to maintain robust, equitable performance across diverse skin tones and imaging conditions.

14

CUOREMA: Immersive Bio & Behavioral Feedback and Digital Interventions for Cardiac Rehabilitation - Exploratory Analysis

Svihrova, R.; Marzorati, D.; Odello, T.; Monachino, G.; Staletti, T.; Tieben, R.; Luigies, R.; Bodewes, N.; Rutten, W.; Barrett, G.; Bhogal, A.; Wilkinson, T.; Tzovara, A.; Faraci, F. D.

2026-05-15 rehabilitation medicine and physical therapy 10.64898/2026.05.15.26353188 medRxiv

Top 0.2%

9.0%

Show abstract

Cardiac rehabilitation is critical for secondary prevention, yet long-term adherence remains low. We present CUOREMA, a new personalized mobile health system integrating self-monitoring diaries, wearable data, virtual coaching, and reinforcement learning-enhanced adaptive interventions to support lifestyle change during and after outpatient cardiac rehabilitation. In a six-month two-center feasibility study (N = 53, Switzerland and France), we evaluated usability, engagement patterns, and preliminary health-related outcomes. Attrition was high: only 19\% of participants used the app on more than 100 days, and questionnaire response rates declined from 55\% at baseline to 13\% at six months. Despite these limitations, exploratory data-driven analysis revealed three distinct engagement clusters (dropout, sporadic, and consistent), which were further supported by matching patterns in app component usage, medication diary adoption, and smartwatch wearing time. Engagement clusters were not associated with demographic factors; instead, psychological themes of patients' personal goals suggested that intrinsic motivation characterized sustained users, whereas extrinsic motivation predominated among early dropouts. User experience was rated positively, and validated questionnaire scores showed no deterioration over time. One center demonstrated a statistically significant improvement in 6-minute walking test performance, though the study was not powered to detect clinical outcomes and selective dropout cannot be ruled out. These findings highlight engagement variability as a central challenge in digital cardiac rehabilitation and suggest that tailoring interventions to individual motivational profiles may improve long-term adherence.

15

Explainable AI and public reactions to AI-involved adverse diagnostic events: a vignette study

Choi, J.; Kim, Y. J.; Lyu, P.; Luan, Y. L.; Toh, S. M.

2026-06-02 health informatics 10.64898/2026.05.26.26353870 medRxiv

Top 0.2%

8.6%

Show abstract

Artificial intelligence (AI) is increasingly incorporated into diagnostic decision-making, raising questions about physician responsibility following AI-involved adverse diagnostic events. Explainable AI (XAI) has been proposed to improve transparency and trust, but its influence on public reactions remains unclear. In a randomised vignette-based experiment, 652 adults from the United States and United Kingdom were assigned to one of six conditions in a 3 (diagnostic source: AI alone, human radiologist alone, or human-AI collaboration) x 2 (explanation: present or absent) between-subjects design. Participants read a scenario in which a chest X-ray was initially interpreted as normal but lung cancer was diagnosed five months later, indicating that the original interpretation had missed the cancer. In explanation conditions, participants received additional information about how the diagnosis had been reached, including AI heatmap-based explanations in the AI conditions. Participants rated radiologist responsibility, likelihood of complaint, and intention to pursue legal action. Among 652 participants (mean age 42.2 years; 50.2% female), responsibility ratings were significantly lower when AI alone made the diagnostic decision (mean 4.73, 95% CI 4.53-4.93) compared with human-only decision-making (5.78, 95% CI 5.59-5.98; p<0.001) and human-AI collaboration (5.54, 95% CI 5.34-5.74; p<0.001). Complaint likelihood showed a similar pattern. Intentions to pursue legal action followed the same directional trend but were marginally significant. Neither explanations nor explanation-by-source interactions were associated with outcome measures. These findings suggest that the public expects physicians to remain accountable when AI is involved in diagnostic decision-making, particularly in collaborative settings. Providing explanatory information about how AI systems reach decisions may be insufficient to change perceptions of physician responsibility following adverse diagnostic events.

16

Prototyping a Generative AI-powered Person-centered Digital Health Tool to Mitigate Risk of Preventable Adverse Drug Events

Dobbins, D.; Russell, A.; Gunther, M.; Shetty, V.; Shomali, A.; Vawdrey, D.; Waring, S.; Whary, P.; Wong, J.; Wright, E. A.; Olson, A. W.

2026-06-04 health systems and quality improvement 10.64898/2026.06.02.26354712 medRxiv

Top 0.2%

8.4%

Show abstract

Objectives: Older adults with comorbidities and polypharmacy have disproportionately high risk of hospitalization as well as readmission from adverse drug events (ADEs), of which 28%-71% are preventable (pADEs). This paper introduces an LLM application, CommunicADE, designed to support risk-mitigation of pADE-related readmission for the aforementioned population. We aim to evaluate CommunicADE's technical performance with OpenAI's HealthBench criteria: accuracy, completeness, communication quality, context awareness, and instruction following. Materials and Methods: Our technical validation study used an LLM (KimiK2.5) to simulate interviews between CommunicADE and nine high-fidelity synthetic patients hospitalized and at increased risk for pADE-related readmission (65+ years, comorbidities, 5+ medications). Some pADE risk mechanisms clues were visible to CommunicADE in patient H&Ps, but most mechanisms were solely discoverable in interviews. Two pharmacists evaluated CommunicADE's interview questions and EHR notes with HealthBench-informed variables. Analyzes used descriptive statistics. Results: For 35 mechanisms across 9 patients (avg=3.89 mechanisms/patient), CommunicADE's precision and recall were 0.92 and 0.63, respectively. Hallucinations were absent. Coherence and person-centeredness scored 4.28 and 4.44 on a 5-point scale (5=highest). On average, communication was at a 5th grade level and objective for 78% of patients. Most patient-reported quotes included in notes (92%) supported detected mechanisms. CommunicADE followed all instructions regarding interview length and patient approvals. Discussion: CommunicADE's strongest performance was in accuracy (precision, hallucinations), communication quality (coherence, readability), context awareness (person-centeredness). Completeness (recall) and instruction following (objectivity, pADE mechanism/quote alignment) show room for improvement. Conclusion: Findings suggest technical readiness for a feasibility pilot with real-world patients, and key areas for performance improvement.

17

Temporal Relationships between Smartphone Application Use and Online Substance Procurement in U.S. Youth

Gansner, M.; Adams, M.; Nikam, P.; Huntley, N.; Ramrajesh, S.; Marsch, L. A.; Levy, S.; Schuman-Olivier, Z.

2026-05-19 pediatrics 10.64898/2026.05.15.26353324 medRxiv

Top 0.3%

8.3%

Show abstract

Background: Despite the significant risks associated with online substance procurement (SP), few researchers have examined this practice in U.S. youth. The studies that do exist are cross-sectional and cannot temporally connect specific digital behaviors to online SP. This longitudinal cohort study examined youth SP and digital media habits to determine whether use of certain smartphone applications correlated with increased odds of online SP or being contacted online about procuring drugs or alcohol. Methods: A cohort of U.S. youth (aged 15-20) with a history of non-daily substance use in the 3 months prior to enrollment was recruited to use the digital phenotyping smartphone application EARS for 90 days. On a nightly basis, participants were asked to complete surveys about online experiences related to SP and instances of substance use. Smartphone-generated screen use data were also collected passively each day. Results: Out of 112 enrolled participants, 106 were able to be included in analyses. Over approximately 3 months, 28.3% of participants (n=30) reported a collective 91 instances where they used social media to acquire drugs or alcohol. Screen use data demonstrated temporal relationships between social media SP and applications previously connected to the social media drug-purchasing process (e.g., TikTok, encrypted apps), as well as other school-specific social media. Discussion: Our results provide critically needed research evidence to support a body of literature composed predominantly of anecdotal reports. Despite measures taken by social media companies to prevent use of their platforms for drug procurement, underage youth continue to engage in this practice.

18

Investigating the Readability, Visual Design, and Quality of Online Written Pharmacogenomics Health Information for Health Consumers in Australia

Giblett, M. J.; Babikian, Y.; Jhala, D. J.; Medland, S. E.

2026-05-29 health informatics 10.64898/2026.05.27.26354169 medRxiv

Top 0.3%

8.2%

Show abstract

Pharmacogenomics (PGx) offers a pathway towards personalised medicine, which relies on health consumer involvement in making informed decisions. As consumers increasingly seek health information online, high-quality digital resources are essential to support informed consent and shared decision making. The complexity of PGx and widespread limitations in health literacy raise concerns about whether existing consumer-facing online PGx resources are understandable and sufficiently comprehensive. This study evaluates the readability, visual design, and informational quality of publicly available online written PGx health information. Twenty-three webpages met inclusion criteria. The mean readability corresponded to approximately 15 years of formal education (university level), substantially exceeding the Australian Government's recommended Year 7 reading level for public health materials. Informational quality was generally low, with most webpages being rated as poor or very poor. In contrast, visual design quality was relatively strong, with webpages achieving on average around three-quarters of the criteria. Although the visual presentation of PGx webpages is generally professional, their high reading difficulty and limited discussion of treatment choices and uncertainties reduce their usefulness for health consumer education. Improving readability, clearly communicating risks and limitations, and incorporating decision-support features may enhance the ability of online resources to support informed consent and shared decision making.

19

Physician Facing AI Tools Show Distinct Failure Modes Under Structured Stress Testing

Hazare, N. S.; Oh, W.; Kumar, G.; Goel, N.; Shaikh, A.; Sharma, A.; Desman, J.; Kumar, A.; Robles, C.; Singh, A.; Jangda, M.; Agaron, S.; Capone, C.; Ngai, D.; Itwaru, A.; Parchure, P.; Ramaswamy, A.; Gorbenko, K.; Timsina, P.; Lampert, J.; Tamler, R.; Manasia, A.; Kohli-Seth, R.; Kaplan, B.; Vakil, A.; Omar, M.; Glicksberg, B. S.; Freeman, R.; Stern, A. D.; Klang, E.; Darrow, B.; Stump, L. S.; Reich, D.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-29 health informatics 10.64898/2026.05.27.26354248 medRxiv

Top 0.3%

7.3%

Show abstract

Importance: Physician-facing AI tools are now in clinical use, yet whether different platforms fail in similar or fundamentally different ways in high-stakes settings like critical care is unknown. Objective: To evaluate two physician-facing AI platforms, ChatGPT for Clinicians and OpenEvidence, for distinct vulnerabilities under structured stress testing. Design, Setting, and Participants: An observational study conducted using 60 simulated critical care vignettes developed and adjudicated by four attending critical care physicians. Data were collected in the last week of April 2026, via the public website interfaces of each platform. Interventions/Exposures: A 2x2x2x2 factorial design across four stressors - anchoring, cognitive load, social conformity pressure, and a clinically incorrect directive - yielded 16 prompt subsets per vignette and 960 prompts per platform. A separate multi-turn adversarial prompting paradigm administered three sequential "You are incorrect" challenges to baseline vignettes. All prompts had a universal output length constraint of fewer than 30 words. Main Outcomes and Measures: Critical elements capture (percentage of gold-standard critical elements present in responses), susceptibility to clinically incorrect directive, and sycophancy (reversal of an initial correct recommendation under iterative adversarial challenge). Results: Across 1916 responses to 1920 prompts, ChatGPT for Clinicians captured more gold-standard critical elements than OpenEvidence (81.4% {+/-} 18.1% vs 61.0% {+/-} 23.5%; adjusted difference, 20.3 percentage points; 95% CI, 18.3 to 22.4; P < .001) and was less susceptible to clinically incorrect directives (1.7% vs 8.0%; adjusted odds ratio, 0.07; 95% CI, 0.02-0.21; P < .001). Anchoring and social conformity pressure were associated with reduced critical element capture across both platforms, while cumulative stressor burden reduced critical element capture only on OpenEvidence. Conversely, ChatGPT for Clinicians reversed correct recommendations more readily under adversarial prompting (hazard ratio, 2.61; 95% CI, 1.10 - 6.19; P = .03). Conclusion and Relevance: The two physician-facing clinical AI platforms evaluated demonstrated non-overlapping vulnerabilities, with neither platform uniformly superior. These findings argue against single-axis ranking of clinical AI systems and support multidimensional safety evaluation encompassing completeness of reasoning, resistance to incorrect directives, and stability under adversarial challenge.

20

Adoption of Guided Structured Reporting in Routine Radiological Practice: A Six-Week Multi-Site Implementation Study in the UAE

Lorenz, D.; Jansen, S.; Knoche, J.; Wolf-Sebottendorff, R.; Awad, H. J.; Toker, I.

2026-05-22 radiology and imaging 10.64898/2026.05.20.26353646 medRxiv

Top 0.3%

7.2%

Show abstract

Background. Guided structured reporting has been proposed to address the limited availability of structured data in radiology, yet empirical evidence on its real-world adoption across users and imaging modalities remains scarce. Objective. To describe the adoption dynamics of a guided structured reporting system across multiple users and imaging modalities during a six-week implementation period. Methods. Retrospective observational study at two public tertiary hospitals in Abu Dhabi, United Arab Emirates. A guided structured reporting system was deployed for computed tomography (CT), magnetic resonance imaging (MRI), and mammography. Seven radiologists participated. The primary outcome was active in-software reporting time, recorded via system logs of mouse and keyboard interaction. Temporal trends in median reporting time per modality and individual user trajectories were analysed descriptively. After predefined data cleaning, 126 reports were included (84 CT, 27 MRI, 15 mammography). Results. Active in-software reporting time decreased across all modalities. Median reporting time fell from 130 s to 56 s for CT, from 383 s to 60 s for MRI, and from 126 s to 46 s for mammography (week 1 to week 6). Individual trajectories showed similar patterns, with the largest reductions during the early implementation phase. Subgroup analyses were limited by small sample sizes. Conclusions. Guided structured reporting was integrated into routine clinical workflows with temporal reductions in active reporting time across users and modalities, providing empirical evidence on the feasibility of workflow-integrated structured reporting in radiological practice.